Technical Assessment

This file contains the data analysis as requested. I also developed a Dashboard in order to better display and interact with Outvio’s databases.

Minimum requirements

This section presents the resolution of the minimum requirements of this task.

1. Average delivery time per courier

DeuschePost presents the highest deliveryTime, followed by transaher and fedex.

shipments %>% 
  filter(!is.na(deliverDate)) %>% # Remove not delivered orders
  group_by(courier) %>%
  summarise(mean = mean(deliveryTime)) %>%
  arrange(desc(mean))

2. Average delivery time per shipping method

“dhl express - gpt - priority (packet tracked)” presents the highest deliveryTime.

shipments %>% 
  filter(!is.na(deliverDate)) %>% # Remove not delivered orders
  group_by(courier) %>%
  summarise(mean = mean(deliveryTime)) %>%
  arrange(desc(mean))

3. Average products per order

shipments_aux <- shipments %>%
  separate_rows(packages) %>% 
  filter(!packages %in% c('oid', ':', '')) # Unnest and create one row per package

packages_aux <- packages %>% 
  separate_rows(products) %>%
  filter(!products %in% c('oid', ':', '')) # Unnest and create on row per product

products_per_order <- shipments_aux %>%
  inner_join(packages_aux, by = c('packages' = 'X_id')) %>% # Join Shipments and Packages data
  group_by(X_id) %>%
  summarise(products = n_distinct(products)) %>% # Count # of products whithin each order
  ungroup() %>%
  summarise(products_per_order = mean(products))

c('In average, each order presents 2.85 products.')
## [1] "In average, each order presents 2.85 products."

Aditional Analysis

In order to better explore the provided data bases, I formulate some questions to be answered through data analysis.

2. How accurate is the delivery prediction?

Delayed vs On Time orderds

Comparing the effective deliver time with the predicted, 16% of the orders have been delayed through the analyze time period.

shipments %>% 
  filter(!is.na(delayed)) %>% # Remove not delivered orders
  group_by(delayed) %>%
  summarise(count = n_distinct(X_id)) %>%
  ungroup() %>%
  mutate(perc = count/sum(count)) %>% # Criando a visão percentual
  plot_ly(y = ~perc, 
          x = ~delayed, 
          type = 'bar') %>%
  layout(barmode = 'stack')

Delayed vs On Time orders per Courier

The plot is comparing On Time delivered orders versus Delayed orders per courier. Envialia presents a difficult situation, since it is the most popular courier and, regarding the main couriers, is the one that presents more delayed orders.

shipments %>% 
  filter(!is.na(delayed)) %>% # Remove not delivered orders
  group_by(delayed, courier) %>%
  summarise(count = n_distinct(X_id),
            predictedDiff = mean(predictedDiff)) %>%
  ungroup() %>%
  group_by(courier) %>%
  mutate(perc = count/sum(count)) %>% # Criando a visão percentual
  ungroup() %>%
  plot_ly(y = ~count, 
          x = ~reorder(courier, desc(count)),
          color = ~delayed,
          type = 'bar') %>%
  layout(barmode = 'stack')

Delayed vs On Time orders per Courier (Percentage)

The plot is comparing On Time delivered orders versus Delayed orders per courier. Envialia presents a difficult situation, since it is the most popular courier and, regarding the main couriers, is the one that presents more delayed orders.

shipments %>% 
  filter(!is.na(delayed)) %>% # Remove not delivered orders
  group_by(delayed, courier) %>%
  summarise(count = n_distinct(X_id),
            predictedDiff = mean(predictedDiff)) %>%
  ungroup() %>%
  group_by(courier) %>%
  mutate(perc = count/sum(count)) %>% # Criando a visão percentual
  ungroup() %>%
  plot_ly(y = ~perc, 
          x = ~reorder(courier, desc(count)),
          color = ~delayed,
          type = 'bar') %>%
  layout(barmode = 'stack')

How long the delyed orders took to arrive?

60% of the delayed orders were delayed by only one day. Almost 90% of the delayed orders were delayed by a maximum of 5 days.

shipments %>% 
  filter(delayed == 'Delayed') %>% # Remove not delivered orders
  group_by(predictedDiff) %>%
  summarise(count = n_distinct(X_id)) %>%
  ungroup() %>%
  mutate(perc = count/sum(count)) %>% # Criando a visão percentual
  plot_ly(x = ~predictedDiff, 
          y = ~perc, 
          type = 'bar')

3. How can we compare couriers regarding performance?

One option to estimate which are the top performers courier is to look for the orders volume of delivers versus delivery Time.

shipments %>%
  filter(!is.na(deliverDate)) %>% # Remove not delivered orders
  group_by(courier) %>%
  summarise(volume = n_distinct(X_id),
            deliveryTime = mean(deliveryTime)) %>%
  plot_ly(x = ~deliveryTime,
         y = ~volume,
         text = ~courier)
shipments %>%
  filter(!is.na(deliverDate) & delayed == 'On Time') %>% # Remove not delivered orders
  group_by(courier) %>%
  summarise(volume = n_distinct(X_id),
            deliveryTime = mean(deliveryTime)) %>%
  plot_ly(x = ~deliveryTime,
         y = ~volume,
         text = ~courier)

4. How is the orders behavior during the time? Is there any sazonality?

shipments_aux <- shipments %>%
  separate_rows(packages) %>% 
  filter(!packages %in% c('oid', ':', '')) # Unnest and create one row per package

packages_aux <- packages %>% 
  separate_rows(products) %>%
  filter(!products %in% c('oid', ':', '')) # Unnest and create on row per product

shipments_aux %>%
  inner_join(packages_aux, by = c('packages' = 'X_id')) %>% # Join Shipments and Packages data
  group_by(X_id) %>%
  summarise(products = n_distinct(products)) %>% # Count # of products whithin each order
  ungroup() %>%
  summarise(products_per_order = mean(products))